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O ■ Abstract 
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In this paper we study the problem of approximately releasing the cut function of a graph 

while preserving differential privacy, and give new algorithms (and new analyses of existing 
algorithms) in both the interactive and non-interactive settings. 

Our algorithms in the interactive setting are achieved by revisiting the problem of releasing 
differentially private, approximate answers to a large number of queries on a database. We 
show that several algorithms for this problem fall into the same basic framework, and are based 
on the existence of objects which we call iterative database construction algorithms. We give 
^\ | a new generic framework in which new (efficient) IDC algorithms give rise to new (efficient) 

interactive private query release mechanisms. Our modular analysis simplifies and tightens the 
analysis of previous algorithms, leading to improved bounds. We then give a new IDC algorithm 
(and therefore a new private, interactive query release mechanism) based on the Frieze/Kannan 
low-rank matrix decomposition. This new release mechanism gives an improvement on prior 
work in a range of parameters where the size of the database is comparable to the size of the 
data universe (such as releasing all cut queries on dense graphs). 

We also give a non-interactive algorithm for efficiently releasing private synthetic data for 
graph cuts with error Od^l 1 ' 5 ). Our algorithm is based on randomized response and a non- 
private implementation of the SDP-based, constant-factor approximation algorithm for cut-norm 
due to Alon and Naor. Finally, we give a reduction based on the IDC framework showing that 
an efficient, private algorithm for computing sufficiently accurate rank-1 matrix approximations 
would lead to an improved efficient algorithm for releasing private synthetic data for graph cuts. 
We leave finding such an algorithm as our main open problem. 
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1 Introduction 
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Consider a graph representing the online communications between a set of individuals: each vertex 
represents a user, and an edge between two users indicates that they have corresponded by email. 
It might be extremely useful to allow data analysts access to this graph in order to mine it for 
statistical information. However, the graph is also composed of sensitive information, and we 
cannot allow our released information to reveal much about the existence of specific edges. Thus 
we would like a way to analyze the structure of this graph while protecting the privacy of individual 
edges. Specifically we would like to be able to provide a promise of differential privacy [DMNS06J 
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(defined in Section [2]), which, roughly, requires that our algorithms be randomized, and induce 
nearly the same distribution over outcomes when given two data sets (e.g. graphs) which differ in 
only a single point (e.g. an edge). 

One natural objective is to provide private access to the cut function of this graph. That is, to 
provide a privacy preserving way for a data analyst to specify any two (of the exponentially many) 
subsets of individuals, and to discover (up to some error) the number of email correspondences 
that have passed between these two groups. There are two ways we might try to achieve this goal: 
We could give an interactive solution where we give the analyst private oracle access to the cut 
function. Here the user can write down any sequence of cut queries and the oracle will respond 
with private, approximate answers. We may also try for a stronger, non-interactive solution, in 
which we release a private synthetic dataset; a new, private graph that approximately preserves the 
cut function of the original graph. 

The case of answering cut queries on a graph is just one instance of the more general problem of 
query release for exponentially sized families of linear queries on a data set. Although this problem 
has been extensively studied in the differential privacy literature, we observe that no previously 
known efficient solution is suitable for the case of releasing all cut queries on graphs. In this paper 
we provide solutions to this problem in both the interactive and non-interactive settings. 

We give a generic framework that converts objects that we call iterative database construction 
(IDC) algorithms into private query release mechanisms in both the interactive and non-interactive 
settings. This framework generalizes the median mechanism |RR10j . the online multiplicative 
weights mechanism [HR10], and the offline multiplicative weights mechanism [GHRU11, HL Mllj . 
Our framework gives a simple, modular analysis of all of these mechanisms, which lead to tighter 
bounds in the interactive setting than those given in [RR10] and [H R10| . These improved bounds 
are crucial to our objective of giving non-trivial approximations to all possible cut queries. We also 
instantiate this framework with a new IDC algorithm for arbitrary linear queries that is based on 
the Frieze/Kannan low-rank matrix decomposition [FK99a] and is tailored to releasing cut queries. 
This algorithm leads to a new online query release mechanism for linear queries that gives a better 
approximation in settings (such as we would encounter trying to answer all cut queries on a dense 
graph) where the database size is comparable to the size of the data universe. We summarize our 
bounds in Table CD 

We also give a new algorithm (building on techniques for constructing private synthetic data 
in [BCD + 07l lDNR + 09j ) in the non-interactive setting that efficiently generates private synthetic 
graphs that approximately preserve the cut function. Finally, we use our IDC framework to show 
that an efficient, private algorithm for the problem of privately computing good rank-1 approxi- 
mations to symmetric matrices would automatically yield efficient private algorithms for releasing 
synthetic graphs with improved approximation guarantees. 

1.1 Our Results and Techniques 

Our main conceptual contribution is to define the abstraction of iterative database construction 
algorithms (Section [3|) and to show that an efficient IDC for any class of queries Q automatically 
yields an efficient private data release mechanism for Q in both the interactive and non-interactive 
settings. Informally, IDCs construct a data structure that can be used to answer all the queries 
in Q by iteratively improving a hypothesis data structure. Moreover, they update the hypothesis 
when given a query witnessing a significant difference between the hypothesis data structure and 
the underlying database. 

In hindsight, this framework generalizes the median mechanism [RR10] and the subsequent 
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Table 1: Comparison of accuracy bounds for linear queries. The bounds in the first column are 
prior to this work, the second column are what we achieve in this work, and the last column are the 
new bounds instantiated for releasing all cut queries. The bounds listed here are approximate and 
hide the dependence on certain parameters, such as 5 and /3. n denotes database size, k denotes 
the total number of queries answered, and X represents the data universe. For a graph G = (V, E), 
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), and for all cut queries, k = 2 2 ' L Previous efficient results do not 



achieve non-trivial (< \E\) error, while all of the new bounds do for sufficiently dense graphs. 



"The bounds listed here are for linear queries. The Median Mechanism more generally works for any set of 
low sensitivity queries Q that have an a- net of size N a (Q). We improve the bound from the solution to a = 

lo g (iVo(Q))lo B 2 (a) tQ the solution tQ a = yiogAKQlog* 

6 Here we use ri2 = ||2?||2i i n contrast to other known IDCs, whose error is in terms of n — ||2?||i. Note that 
n < ri2 < n 2 . 

Tor k < \X\/2. This is an approximate bound on average per-query error. All other algorithms listed bound 
worst-case per-query error. 



refinement for linear queries, the online multiplicative weights mechanism [HRlOj . It also generalizes 
the offline multiplicative weights mechanism [GHRU11, HLM11]. All of these mechanisms can be 
seen to use IDCs of the sort we define in this work. (In Appendix [A] we show how these algorithms 
fall into the IDC framework.) 

Our generalization and abstraction also allows for a simple, modular analysis of mechanisms 
based on IDCs. Using this analysis, we are able to show improved bounds on the accuracy of both 
the median mechanism and multiplicative weights mechanism. These improved bounds are critical 
to our application to releasing all cut queries. For these parameters, the previous bounds would 
not guarantee error that is < \E\, meaning that the error may be larger than the largest cut in the 
graph. Of course, we can privately guarantee error < \E\ simply by releasing the answer for every 
cut query. Our new analysis shows that these mechanism are capable of answering all cut queries 
with error o(|-E|) for sufficiently dense graphs. 

We also define a new IDC based on the Frieze/Kannan low-rank matrix decomposition [FK99a|, 
which yields a private interactive mechanism for releasing linear queries. Our new mechanism 
outperforms previously known techniques when the size of the database is comparable to the size 
of the data universe, as is the case on a dense graph. 

We then consider the problem of efficiently releasing private synthetic data for the class of cut 
queries. We show that a technique based on randomized response efficiently yields a private data 
structure (but not a synthetic databse) capable of answering any cut query on a graph with |V| 
vertices up to maximum error 0(|V| 1,5 ). We then show how to use this data structure to efficiently 
construct a synthetic database with only a constant factor blowup in our error. Our algorithm is 
based on a technique for constructing synthetic data in [BCD + 07[ |DNR + Q9j . Their observation 
is that, for linear queries, the set of accurate synthetic databases is described by a (large) set of 



linear constraints. In the case of cut queries, we are able to use a constant-factor approximation to 
the cut-norm due to Alon and Naor [AN06J as the separation oracle to find a feasible solution (and 
thus a synthetic database) efficiently. Finally, we show how the existence of an efficient private 
algorithm for finding good low-rank approximations to matrices would imply the existence of an 
improved algorithm for privately releasing synthetic data for cut queries, using our IDC framework. 

1.2 Related Work 



Differential privacy was introduced in a series of papers [BDMN05, ICDM + 05l [DMNS06] in the 



last decade, and has become a standard solution concept for statistical database privacy. The 
first mechanism for simultaneously releasing the answers to exponentially large classes of statistical 
queries was given in [BLR08J . They showed that the existence of small nets for a class of queries Q 
automatically yields a (computationally inefficient) non-interactive, private algorithm for releasing 
answers to all the queries in Q with low error. Subsequent improvements were given by Dwork et 
al. [ DNR+09llDRVT0] . 

Roth and Roughgarden [RR10] showed that large classes of queries could also be released 
with low error in the interactive setting, in which queries may arrive online, and the mechanism 
must provide answers before knowing which queries will arrive in the future. Subsequently, Hardt 
and Rothblum [HR10] gave improved bounds for the online query release problem based on the 
multiplicative weights algorithm. In hindsight, both of these algorithms follow the same basic 
framework, which is to use an IDC. 

Gupta et al. [GHRU11] gave a non-interactive data release mechanism based on the multiplica- 
tive weights algorithm and an arbitrary agnostic learner for a class of queries. An instantiation 
of this algorithm (the offline multiplicative weights algorithm) using the generic agnostic learner 
of Kasiviswanathan et al. KLN + 08] (who use the exponential mechanism of |MT07j ) was imple- 



mented and experimentally evaluated on the task of releasing small conjunctions to low error on 
real data by Hardt, Ligett, and McSherry [HL Mllj . This algorithm gives bounds comparable to 
those given in this paper, but it does not work in the interactive setting, and is not computationally 
efficient for settings in which the number of queries is exponentially larger than the database size 
(as is the case with graph cuts). We note in Section [7] that this generic algorithm can also be 
instantiated with any iterative database construction algorithm. 

Hardt and Talwar [HT10J consider the setting where the number of queries is smaller than the 
universe size. When the number of queries is comparable to the universe size (i.e. \Q\ = £l(\X\)), 
their K-Norm mechanism gives average error that is smaller than the worst-case error promised 
by the online multiplicative weights mechanism when the database size is n > O I ,J ,L, I . This is 

the same range of parameters for which the Frieze/Kannan IDC algorithm improves on the online- 
multiplicative weights, and in this range of parameters, it achieves roughly the same error as the 
K-norm mechanism. In general, the bounds for the two mechanisms are incomparable: e.g., |HT10| 
have a better, logarithmic dependence on \X\, compared to the polynomial dependence for the 
Frieze/Kannan IDC. On the other hand, the Frieze/Kannan IDC (and all algorithms in the IDC 
framework) have some advantages. Specifically, the bounds are for worst-case error, rather than 
average-case error; hold unconditionally, while the accuracy of the K-norm mechanism relies on 
the truth of the hyperplane conjecture; apply even when the number of queries is larger than the 
universe size; and typically have running time linear in \X\, rather than polyd^l). 

The Frieze-Kannan low-rank approximation (or the weak regularity lemma) shows that every 
matrix can be approximated by a sum of few cut matrices [FK99a, FK99bJ: this fact has many 
important algorithmic applications. We also use the fact that the proof extends to more general 



settings, as was noted by |TT V09j . 

2 Preliminaries 

In this paper, we study datasets T> that consist of collections of n elements from some universe X . 
We can also write T> £ N' ' when it is convenient to represent T> as a histogram over X. We say 
that two databases T>, T>' are adjacent if they differ in only a single element. As histograms, they 
are adjacent if ||P — T>'\\ < 1. We will require that our algorithms satisfy differential privacy: 

Definition 2.1 (Differential Privacy). A randomized algorithm M : N'*' —> R (for any abstract 
range R) satisfies (e, ^-differential privacy if for all adjacent databases T> and V , and for all events 
SCR: 

Pr[M(D) £5]< exp(e)Pr[M(£>') G S] + 5 

We will generally think of e as being a small constant, and 5 as being negligibly small - i.e. 
smaller than any inverse polynomial function of n. 

We note that when we will discuss interactive mechanisms, we must view the output of a 
mechanism as a transcript of an interaction between an adaptive adversary who supplies questions 
about the database based on previous outcomes of the mechanism, and the mechanism itself. For 
clarity, in this paper we will elide specifics about the model of adaptive private composition. For a 
detailed treatment of this issue, see [DRV10J. 

A useful distribution is the Laplace distribution. 

Definition 2.2 (The Laplace Distribution). The Laplace Distribution (centered at 0) with scale b 
is the distribution with probability density function: Lap(x|6) = 2jexp(— ^). We will sometimes 
write Lap(6) to denote the Laplace distribution with scale b, and will sometimes abuse notation 
and write Lap(6) simply to denote a random variable X ~ Lap(6). 

A fundamental result in data privacy is that perturbing low sensitivity queries with Laplace 
noise preserves (e, 0)-differential privacy. 

Theorem 2.3 ( |DMNS06] ). Suppose Q : N^l -> R k is a function such that for all adjacent 
databases T> and T>' , \\Q{T>) — Q(D')\\i < 1. Then the procedure which on input T> releases Q(T>) + 
(Xi, . . . ,Xk), where each Xi is an independent draw from a Lap(l/e) distribution, preserves (e,0)- 
differential privacy. 

It will be useful to understand how privacy parameters for individual steps of an algorithm 
compose into privacy guarantees for the entire algorithm. The following useful theorem is due to 
Dwork, Rothblum, and Vadhan: 

Theorem 2.4 ([DRV10J). Let < e < 1 be a parameter. Let P, Q be probability measures supported 
on a set S such that max seS |log (P{s)/Q{s))\ < e. Then E P [log (P(s)/Q(s))} < 2e 2 . 

We are interested in privately releasing accurate answers to large collections of queries. Queries 
are functions Q : N' ' — > M, and we denote collections of queries by Q. We write k = \Q\ to denote 
the cardinality of the set of queries. 

A common type of queries are linear queries. A linear query Q has a representation as a vector 
[0, 1]'*', and can be evaluated on a database by taking the dot product between the query and the 
histogram representation of the database: Q(T>) = Q ■ T>. 



Definition 2.5 (Accuracy). Let Q be a set of queries. A mechanism M : W ' — > TZ is (a, /3)- 
accurate for Q if there exists a function Eval :QxK->K s.t. for every database T> G N' ', with 
probability at least 1 — /3 over the coins of M, M(T>) outputs r G 1Z such that maxg g g IQ(^) — 
Eval(Q,r)| < a. We will abuse notation and write Q(r) = Eval(Q,r). 

We say that an algorithm M releases synthetic data (as is the case for our new IDC, as well 
as the multiplicative weights IDC [HRlOj ) if Tl = N^l In this case, M(V) = V' G N^l and 
Eval(2?', Q) = Q(T>'). We say that a synthetic data release algorithm is efficient if it runs in time 
polynomial in n = ||2?||i, the size of the data set. Note that if n <C |A?|, efficient algorithms will 
have to input and output concise representations of the dataset (i.e., as collections of items from 
the universe) instead of using the histogram representation. Nevertheless, it will be convenient to 
think of datasets as histograms. 

We say an algorithm efficiently releases k queries from a class Q in the interactive setting if on 
an arbitrary, adaptively chosen stream of queries Q\, . . . , Qk, it outputs answers a%, . . . ,a&. The 
algorithm must output each Oj after receiving query Qi but before receiving Qi+i, and is only allowed 
poly(n) run time per query. We are typically interested in the case when k can be exponentially 
large in n. Note that as far as computational efficiency is concerned, releasing synthetic data for a 
class of queries k is at least as difficult as releasing queries from k in the interactive setting, since 
we can use the synthetic data to answer queries interactively. 

Graphs and Cuts. When we consider datasets that represent graphs G = (V, E), we think of the 
database as being the edge set T>q = E, and the data-universe being the collection of all possible 
edges in the complete graph: \X\ = ( 2 ) ■ That is, we consider the vertex set to be common among 
all graphs, which differ only in their edge sets. One example we care about is approximating the 
cut function of a private graph G. 

For any real- valued matrix A e W nxm , for S C [m] and T C [m 1 ], we define A(S,T) := 
^s&SteT A s t- The cut norm of the matrix A is now defined as \\A\\c '■= niax5c[m],TC[m / ] l J 4(£>')^~')l- 
A graph G can be represented as its adjacency matrix Aq g {0, 1}I^ / I X I^ / I. In this paper, a cut in a 
graph G is defined by any two subsets of vertices S, T C V. We write the value of an S, T cut in G as 
G(S,T) := Aq(S,T), where Aq is the adjacency matrix of G. Similarly, we extend the definition 
of cut norm to n vertex graphs naturally by defining ||G||c := ||-4g||c = rnaxs,Tcy \G(S, T)\ 
and || G — H\\c ■= \\Aq — Ah\\c- The class of cut queries Qcut = {Qs,T '■ S,T C V}, where 
Qs,t(G) = Aq(S,T). Note that cut queries are an example of a class of linear queries, because we 
can represent them as a vector in which Qs,T[i,j] = 1 if i G S 1 , j G T and otherwise, and evaluate 

Qs,T(G)=Ei d eyQs,T[h3}-MhJ}- 

Note that as linear queries, we can write cut queries as the outer product of two vectors: Qs,T = 

Xs ' Xt> wnere XSi XT £ {0, 1}' V ' are the characteristic vectors of the sets S and T respectively. Let 
us define a more general class of rank-1 queries on graphs to be a subset of all linear queries: 

Q rl = {Q g [0, l]l F l x l y l such that Q = u ■ v T for some vectors u, v G [0, l] |y| } 

A rank-1 query is a linear query and can be evaluated 

QuA G ) = Yl Q\hJ\ A G[hJ] = Yl u[i\v\j]Ac:[i,j] 
i,jev i,jev 

Of course the set of rank-1 queries includes the set of cut queries, and any mechanism that is 
accurate with respect to rank-1 queries is also accurate with respect to cut queries. 



3 Iterative Database Constructions 

In this section we define the abstraction of iterative database constructions that includes our 
new Frieze/Kannan construction and several existing algorithm [RR10, HR10J as a special case. 
Roughly, each of these mechanisms works by maintaining a sequence of data structures V^' , T>( 2 > , . . . 
that give increasingly good approximations to the input database T> (in a sense that depends on the 
IDC). Moreover, these mechanisms produce the next data structure in the sequence by considering 
only one query Q that distinguishes the real database in the sense that Q(T>^') differs significantly 
from Q(V). 

Syntactically, we will consider functions of the form U : 1Z\j x Q x R — >• TZ\j. The inputs 
to U are a data structure in T^u, which represents the current data structure V^>; a query Q, 
which represents the distinguishing query, and may be restricted to a certain set Q; and also a real 
number, which estimates Q(T>). Formally, we define a database update sequence , to capture the 
sequence of inputs to U used to generate the database sequence V^- 1 ' ,V^ 2 ' , .... 

Definition 3.1 (Database Update Sequence). Let T> e W ' be any database and let 
I (V^jQ^jA®) > G (TZ\j x Q x R) c be a sequence of tuples. We say the sequence is an 

(U, T>, Q, a, C)- database update sequence if it satisfies the following properties: 
1. D« = P(0,-,-), 



2. for every t = 1,2,..., C, 


Q(t)(p)_Q(t) 


(Z>(*>)| >a, 


3. for every t = 1, 2, . . . , C, 


Q®(V)-A® 


< a, 


4. and for every t = 1, 2, . . . 


,c-i, p( i+1 ) 


= U(pW,QW,lW) 



We note that for all of the iterative database constructions we consider, the approximate answer 
A^> is used only to determine the sign of Q^(T>) — QW(Dw), which is the motivation for requiring 
that A^' have error smaller than a. The main measure of efficiency we're interested in from an 
iterative database construction is the maximum number of updates we need to perform before the 
database T>^' approximates T> well with respect to the queries in Q. To this end we define an 
iterative database construction as follows: 

Definition 3.2 (Iterative Database Construction). Let U : 7£u x Q x R — )• T^u be an update rule 
and let B : R — > R be a function. We say U is a B (a) -iterative database construction for query 
class Q if for every database V S W ', every (U, P, Q, a, C)-database update sequence satisfies 
C <B(a). 

Note that the definition of an i3(a)-iterative database construction implies that if U is a B(a)- 
iterative database construction, then given any maximal (U, V, Q, a, C)-database update sequence, 
the final database £>' ' must satisfy maxQgg Q("D) — Q(X>( ') < a or else there would exist 
another query satisfying property 2 of Definition l3.il and thus there would exist a (U, T>, Q, a, C+l)- 
database update sequence, contradicting maximality. 

4 Query Release from Iterative Database Construction 

In this section we describe an interactive algorithm for releasing linear queries using an arbitrary 
iterative database construction. 



Algorithm 1 Online Query Release Mechanism 



M v (V,e,5,a,{3,k): 

Input: A database T> e W ', a parameter a£l, parameters e, 5, /3 6 [0, 1], and the number of 
queries A; £ N. Oracle access to U, a B = -B(a)-iterative database construction for Q. 
Parameters: 



1000/B(qO -106(4/^) 
<7 = <?(«) := T = T[a) := 4cr(a) • log(2/e/p). 

SetpW :=U(0, V ), C = 0. 
For: £=l,2,...,fc 

1. Receive a query Q^> G Q and compute 

2. If: |1W - A(*)| < T then: output A® and set £>( t+1 ) = X>W 

Else: output A®, set £>( t+1 ) = U (v®, Q (*),!(*) ) , and set C = C + 1. 

3. If: C = B(a) then: terminate. 

4.1 Privacy Analysis 

Theorem 4.1. Algorithm^ is (e, 5) -differentially private. 

Proof. Our privacy analysis follows the approach of [HR10J. Intuitively we will consider each 
round of the mechanism individually, conditioned on the previous rounds and classify each round 
by the amount of "information leaked" from the database. We will use this classification, as well 
as Azuma's Inequality to bound the total amount of information leaked. 
Consider the vector 



v = {v«} 



i" if t was an update round, 
_L otherwise. 



Observe that v and the list of queries Q^- 1 ' , ■ ■ ■ Q^ k Vl are sufficient to reconstruct the internal state 
of the mechanism, and thus its output, in each round. Therefore it will be sufficient to demonstrate 
that a mechanism that releases v is (e, <5)-differentially private. 

Fix any two adjacent databases T>\ and T>2, and let V\ and Vi denote the distributions on the 
vectors v when T>\ and, T>2 are the input database, respectively. Also fix a vector v£(IU JL) . 
We will use v^*' to denote the first t entries of the vector v. We will analyze the following privacy 
loss function for each possible output vector v 

*(v) = io g (£H =y>g' ' ' 



V2(y)J f^ 6 ^y 2 (v(*)|v«*)) y / 

In each round t = 1, 2, . . . , k, we define three ranges for the value of the noise Z^> that will 
describe whether or not we were "never", "sometimes", or "always" going to do an update in 



1 We treat all the parameters of the mechanism, a, /?, e, S, k as well as the query sequence Q , . . . , Q as public 
information. 



round t. Specifically, let R® = Q®(V) - Q®(V^- 1 )). Note that R® = A® 
R® + Z® = A® - A®. Now let 



AW and that 



E 
E. 
E. 



(t) 
l 

(*) 

2 
(*) 



-T - R (t) + a,T - R® -a) 

T - R® - a ,-T- R® + a] U [T - R® 

-co, -T - R® - a) U (T - i? (t) + a, oo) 



o-, r - «(*) + a] 



Intuitively, the event -EJ; corresponds to values of the noise where Ay' — hy> is sufficiently small 
that switching databases could not cause an update. In these rounds, v'*' = _L with probability 
1 under both V\ and V2, so there is no privacy loss. The event £3 corresponds to values of the 
noise where Ay> — Ky> is sufficiently large that switching databases could not prevent an update. 
These rounds do leak information about the database, but the update will increment C, and thus 
there can only be B(a) such rounds. The event E 2 are the problematic rounds. In these rounds 
we may not update and increment C, thus in principle there may be an arbitrary number of these 
rounds. However, A® — A® may be close enough to the update threshold that switching from T>\ 
to T>2 would cause an update. Thus these rounds may incur privacy loss. The remainder of the 
analysis relies on showing that there are not too many such rounds. 

Now we make the following claims about the privacy loss in each type of round, based on the 
properties of the Laplace distribution and the way in which we defined the events E\ , E% , E% . 

Claim 4.2. For every u£lUl and every t = 1,2, ... ,k 



log 



Vi(vW=«|zWeEj t) > v«*)' 
V 2 (v® =u\ zW e£{ f) ,v«*)) 



Proof. Note that under both conditional measures, the probability of v( 4 ) = _L is 1. 
Claim 4.3. For every «£lUl and every t = 1,2, ... ,k 

V 1 (v( i )=n|zW^f ) ,v« t ) S 



□ 



log 



V 2 (v® =u\ Z« iEf ] M<^ 



<lle 



lie 



1000 • VB ■ log(4/<5) 



The proof of this claim requires a straightforward analysis of the event u = _L under both 
conditional measures. To not interrupt the flow of the larger proof, we defer the details until later. 
The next claim states that the expected privacy loss is considerably smaller than the worst-case 
privacy loss. 



Claim 4.4. For every uGlUl and every t = 1, 2, 

Vi(v( i )=n|ZW^f ) ,v« t ) N 



E 



log 



VWvW = u \ ZW ^#f ,v«*) 



,k 



< 242eg 



121^ 



500000 B- log 2 (4/5) 



as long as lleo < 1. 

Proof. This claim follows from Claim 1431 and Theorem [27 



□ 



In light of the previous claims, we want to bound the number of rounds in which E[ does not 



occur. Let H 



[t\Z®£E®] 



Claim 4.5. For every t = 1,2, ... ,k 



Pr 



Z® e E® \ Z® <£ E® M <£) ] >l/8. 



Proof. 



Pr 



Z®eE®\Z®eE®uE®M<V 



Pr 



zWeef.vK') 



Pr 



zWe^u^.v^ 



_ /^fl W+CT exp(-z/cT)fe _ exp(-(r-.RW +<j)/(t) 
lT-R(t)-a exp(-z/a)dz exp(-(T - fl(*) - ct)/ct) 
= exp(-2) > 1/8 



Claim 4.6. Mtt probability 1 - 5/2, |#| < 16£log(4/<5). 



D 



Proof. Claim [431 implies that E[|iJ|] < 8B. Note that conditioned on the events of the previous 
rounds, the events Z" G £3 and Z' 4 ** G -E^ UE$ depend only on the coin tosses used to generate 
Z^' , which are independent of all of the other rounds. Thus we can show that the random variable 
\H\ is dominated by a related random variable in which we do the following: In every round t 
with Z® G Elf U EJp, flip a coin c® such that Pr [c® = l] = 1/8. Let H' be defined identically 
to H but in the process where we terminate the algorithm only when ]TT=i c'*' = B, rather than 



the actual termination condition C = B. Since, by Lemma I4.5[ we know that the probability 
Zw g £"3 conditioned on Z"> G E% U E$ is at least 1/8, we can couple these processes to ensure 
that c^' = 1 =^> Z^> G -Eg . Thus, our new process will terminate no sooner than the actual 
algorithm for every choice of random coins, and \H'\ dominates \H\ in CDF. 

Now it suffices to show that \H'\ < 2E[|H'|] log(4/<5) < 165 log(4/<5) with probability least 
1 - 5/2. By a Chernoff bounc@, the probability that £ teH , c® > (1/16 log(4/5))|iJ'| is at least 
1 - 5/2. Thus with probability at least 1 - 5/2 we have \H'\ < 16Blog(4/5). D 

We now give a high-probability bound on the total privacy loss, conditioned on the event that 
\H\ < WB ■ Iog(4/5). 

Claim 4.7. If \H\ < 16B ■ log(4/<5) then 

Pr[|*(v)| >e] <5/2. 



2 A form of the Chernoff bound states that for independent {0, l}-random variables Xi, . . . , X n , with X = 
Er=i X » and /" = E [ X L Pr [^<( 1 -7)M] < exp(-^7 2 /2). From this we deduce that for (i > 2 log (4/7), 
Pr[X< M /log(l/ 7 )]<7. 
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Proof. The expected total privacy loss is 



E[*(v)] =E 



E 1 ^ 
.t=i 



Vi(yW|v« f ))' 
y 2 (v(*)|v«*)) 



E E 

i=l 



log 



>i(yW|v«*))' 

y 2 (vW|v«*)) 



< £)Pr[l<f |v 



«t) 



t=l 
+ Pr 
fc 



^f | v«*) 



•E 



•E 



log 
log 



v V r 2 (v(*)|-.^' ) ,v«*)) < 



(1) 



191e 2 



4=1 

< 1936.Be 2 , 



• 242c 2 , 



e 
< - 



62500 • log 2 (4/(5) ~ 2 



where ([I]) follows from the convexity of relative entropy, and the final inequality follows from 
Claim S3] and the fact that E[|F|] < 8. 

Conditioning on the coins of the mechanism we have 



^(v) = E lQ g T/ (t)\ 



T4(v (t) |v (<t) )' 
(<*)) 



and, by Claim I4T3"1 each term in the sum is at most eo in absolute value. Thus we can apply Azuma's 
Inequalitjo to ^(v) to show 

Pr [|tf (v)| > e] < Pr [|tf (v) - E [tf (v)] | > e/2] 
- 2eXP (-2|^f 



If we condition on the event that \H\ < 16.Blog(4/5) then we have 

Pr[|tf(v)| > e] < 2exp(-log(4/<5)) < 6/2, 
which proves the claim. 

Claims Ei~6l and I4T71 suffice to prove the Theorem. 

We now give a proof of Claim 14.31 
Claim 4.8 (Claim FOj restated). For every u £ R U _L and every t = 1, 2, ... ,k 



log 



y 2 ( v W =u\ Z® ££f } ,v«*) 



<lle 



lie 



1000 • VB ■ log(4/<5) 



D 
D 



! Azuma's Inequality states that for a sequence of random variables Xo,-Xi, . . . , X n , s.t. |Xi — Xj_i| < r\ for 



i = 1, 2, . . . , n, Pr [|X n - X | > 7] < 2 exp(- 7 7n?7 2 ) 
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Proof of Claim \4-3\ First we will bound the left-hand-side in the case of u € R. 



log 



V 1 (v®=u\Z®<tE?\v« t ? 



7 2 (v(*) =«| z(*) ^ J ef ) ,v«*) N 



log 



exp 



( _| n _QW (Pl) | /a) ' 



exp(-|«-g(*)(P 2 )|/a) 



< log (exp(l/a)) = 1/a 
= eo 



(2) 



where inequality j2]) follows because the sensitivity of Qw is bounded above by 1. Now we will 
consider the case of u = _L. 



< 



log 



log 



log 



v x (v® = u \ z® <t e® y<^ 



I-R(t)-T +a exp(-\u\/a)du + JZ^tr-a exp(-|u|/er)du 



-R{t)-T+a 



-R(t)+T-1 



J-R(t)-T+i exp(-\u\/a)du + JL^+y^ exp(-\u\/a)du ^ 
JI^t)-T +a exp(-|n|/cr)dn fI^t)+T-o exp(-|n|/cr)du 



+ 



J-R(t)-T+i exp{-\u\/a)du J_ R(t)+T _ CT e3cp(-|«|/a)du 



log 2 + 



Lr{1)-t +1 exp(-|«|/<7)d« Lr(!)+t-i ex P( 



+ 



|w|/(r)du 



Ln(t)-T+i exp(-\u\/a)du Lb^t-I exp(-|«|/a)d« 



< 



< 



log 

2e 
cr — 



2 1 + 



<7 



1 



4e 
T < — < Heo 

1 (7 



□ 



4.2 Utility Analysis 

Theorem 4.9. Lei D € N'*' 6e any database. And U be a B (a) -iterative database construction 

for query class Q. Then for any (3,e,5 > 0, Algorithmic is ( — ^, j3\- accurate for Q, as long as 

T(a)€ [4a/3,2a]. 

Proof. Roughly, the argument is as follows: Assume we did not add any noise to the queries. Then 
we would answer each query with the true answer Ay* or with A'*) if A'*' is sufficiently close to 
Ay'. Thus the only reason the mechanism would fail to be accurate is if it performs too many 
updates and has to terminate due to the condition C = B. But since we only invoke U when we 
find a query such that \Q^'(V) — Qw(2?w)| is large, we are actually generating a database update 
sequence, which cannot be too long if U is an efficient iterative database construction. To formalize 
this intuition we have to consider the effect of the noise on this process and show that with high 
probability the noise remains in a small enough range that this intuition is indeed correct. 

Fix any a,T(a), such that T(a) £ [4a/3,2a]. For brevity, we use T to denote T{a). First, we 
observe that, with probability 1 — /?, 

max |Z (<) |<r/4. 

t=l,2,...,fc 
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Indeed, by a direct calculation: 

Pr 



max |Z (<) | >T/4 

i=l,2,...fc 



< fc-Pr 



|Z (1) | >T/4 



< 2kexp(-T/4a) 

= 2fcexp(-log(2fc//3)) < /3 



For the rest of the proof we will condition on this event and show that for every t = 1, 2, . . . , k, 

Q(*)(£>(*))-Q(*)(I>) <5T/4 

Assuming the algorithm has not yet terminated, in step 3 we answer each query with either 
AW s.t. 

t > |l (f) - A^i = |qW(x>) + z^ - aW| 

> |Q (t) (P) - A^| - |Z^| > \Q {t) {V) - A^| - T/4, 

(in which case the error is at most 5T/4); or else we answer directly with Ay\ in which case 

|Q(*)(D) - 1W| = |Z (t) | < T/4 < 5T/4. 

Now it suffices to show that Algorithm [1] does not prematurely terminate (due to the condition 
C = B) before answering every query, and in particular that the sequence of invocations of U form 
an (U, V, Q, a, C)-database update sequence. Indeed, if this were the case, then we'd be assured (by 
Definition 13.2ft that after B invocations of U, the resulting database T>* would be (a, Q)-accurate. 
So in every subsequent round we'd have 

|1W_A(*)| = \Q^(V) + Z^ -QW(DW)| < |Q (t) (P)-Q (i) (P (i) )| + |zW| < a + T/4 < T, 

and we'd never make the B + 1 st update. So to complete the proof, we show that we satisfy the 
properties in Definition 13.11 Firstly, in every round in which we invoke U, 

T < |AW - A^| < \Q {t) {V) - qW(X>W)| + \Z®\ < \Q {t) {V) - qW(X>W)| + T/4 

=> |Q W (^) " Q (t) (^ (t) )l > 3T/4 > q 

so that the update sequence satisfies property 2 of Definition 13.11 Secondly, we have already seen 
that in every round 

\A® -Q (t) (£>)| <T/4< a/2, 

so that the update sequence satisfies property 3 of Definition l3.il Properties 1 and 4 of Definition l3.il 
follow by the construction of Algorithm [TJ This completes the proof. □ 

In order to get the best accuracy parameters, one can just solve for the equation a = 3T(a)/4; 
substituting for T(-), this is the same as solving the following equation for a: 



a 



3000 y^Bjaj log (4/5) log(fc/£) 



(3) 
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Corollary 4.10. The Multiplicative Weights mechanism is (e, 5) -differentially private and (a,/3) 
accurate for: 

a = Q / ^(log|A'|) 1 /Vlog(4/^)log(fc/^) \ 

The Median Mechanism is (e, 5) -differentially private and (a, f3) accurate for: 

a = Q I V^(log \X\ log fc)V yiog(4/J) log(fc//3) \ 

Proof. The multiplicative weights and median mechanism subroutines are given in Appendix [AJ 
By Theorem IA. 41 the multiplicative weights subroutine is a B(a)-IDC for B(a) = 4n 2 log \X\/a 2 . 
By Theorem IA.21 the median mechanism subroutine is a B(a)-IDC for B(a) = n 2 log /clog \X\/a 2 . 

The bounds then follow simply by solving for a in the expression a = — . □ 

Remark 4.11. We note that for the setting in which the database represents the edge set of a 
graph G = (V,E), and the class of queries we are interested in is the set of all cut queries, this 
corresponds to an error bound of 0(y/\E\\V\ log(V) ' /y/e). 

5 An Iterative Database Construction Based on Prieze/Kannan 

In this section we describe and analyze an iterative database construction based on the Frieze/Kannan 
"cut decomposition" [FK99aJ. Although the style of analysis we use was originally applied specifi- 
cally to cuts in |FK99a| . we use a generalization of their argument to arbitrary linear queries. To 
our knowledge, such a generalization was first observed in [TTV09 . 

Algorithm 2 The Frieze/Kannan-based IDC 
\J™(V,Q,A): 

If: V = then: output V' = 

Else if: Q(V)-A>0 then: output V = V - ■& ■ Q 

Else if: Q(T>) - A < then: output V' = V + -fa . Q 

Note that the sum in Algorithm [2] denotes vector addition. 

Theorem 5.1. Let T> E W ' be a dataset. For any a > 0, \J^ K is a B (a) -iterative database 
construction for a class of linear queries Q, where B(a) = - — ^ — -. 

Proof. Let T> G N' ' be any database and let 

be (Uq , T>, Q, a, S)-database update sequence (Definition 13. lh . We want to show that C < 
HPHilA'l/a 2 . Specifically, that after ||P||||Ar|/a 2 invocations of U% K , the database p(l!»l!il^l/« 2 ) { s 
(a, Q)-accurate for V, and thus there cannot be a sequence of longer than ||P[|||A^|/a 2 queries that 
satisfy property 2 of Definition 13. 11 
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In order to formalize this intuition, we use a potential argument as in [FK99a] to show that 
for every t = 1,2,... ,B, T>^- t+1 > is significantly closer to T> than P"l Specifically, our potential 
function is the L 2 , norm of the database T> — T>^> , defined as 






D 2 . 



Observe that ||P — P^[|| = H^Hl) an d ll^lli — 0- Thus it suffices to show that in every step, the 
potential decreases by a 2 /\X\. We analyze the case where \Q^' (T)^')\ < A™, the analysis is the 
opposite case will be similar. Let W- 1 ' = Pw — p. Observe that in this case we have 

\Q®{R®)\>a 

and 

Q®(R®) > A® - q(*)(d(*)) - a/2 > -a/2. 

Thus we must have 

Q®(R®)>a. 

Now we can analyze the drop in potential. 

||fl(*)||l - \\R^g = ||fl(*)||l - ||flW - (a/|*|) • Q(*)||l 

= J^*- 1 )^) 2 - (i? (t) (^,i) - («/|a:|) • gW(»; 









JGA" 



2a ^^^ivu a 2 



~ |*p V ' |*| 2 ' ' 

2a 2 a 2 a 2 
~ ]*[ ~ ]*| ~ \X\ 

This bounds the number of steps by ||Z?||2|*|/a 2 , and completes the proof. □ 

Corollary 5.2. Algorithm^ instantiated with \J^ K forj = O (e~ l ' 2 n 2 \X\ 1 ' 4 y / log(k/(3) ) is 
(e, 5) -differentially private and an (a, (3) -accurate interactive release mechanism for query set Q 

with a = o (nT]^!W^mSK\ whe 



—j= J where n2 = ||P||2- Note that for databases that are subsets 

of the data universe (rather than multisets), n<i = n. 

Remark 5.3. We note that for the setting in which the database represents the edge set of a graph 
G = (V,E), and the class of queries we are interested in is the set of all cut queries, this bounds 
corresponds to 0(|y||£'| 1 ' 4 /y / e). This is an improvement on the bound given by the multiplicative 
weights IDC for dense graphs: when \E\ > fi(|l/| 2 /log |V|). 
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6 Results for Synthetic Data 

In this section, we consider the more demanding task of efficiently releasing synthetic data for the 
class of cut queries on graphs. The task at hand here is to actually generate another graph that 
approximates the private graph with respect to cuts. Such a graph can then simply be released to 
data analysts, who can examine it at their leisure. This is preferable to the interactive setting, in 
which an actual graph is never produced, and a central stateful API must be maintained to handle 
queries as they come in from data analysts. Our algorithm is simple, and is based on releasing a 
noisy histogram. Note that for a graph, \X\ = ( 2 )> an d T> = E, so as long as \E\ = fi(|V|), the 
universe is at most a polynomial in the database size. (Moreover, it is easy to show that there does 
not exist any (e, 0)-private mechanism that has error o(|V|), so the only interesting cases are when 
\E\=Q(\V\).) 

Consider a database whose elements are drawn from X; we represent this as a vector (histogram) 
V G W X K Let V = V + (Yi, . . . ,Y\%\) be a "noisy" database, where each Y{ ~ Lap(l/e) is an 
independent draw from the Laplace distribution. Note that by Theorem 12.31 t ne procedure which 
on input T> releases the noisy database T> preserves (e, 0)-differential privacy. This follows because 
the histogram vector can be viewed as simply the evaluation of the identity query Q : W x ' — > N' ', 
which can be easily seen to be 1-sensitive. At this stage, we could release T> and be satisfied that 
we have designed a private algorithm. There are two issues: first, we must analyze the utility 
guarantees that V has with respect to our query set Q. Second, V is not quite synthetic data. It 
will be a vector with possibly negative entries, and so does not represent a histogram. Interpreted 
as a graph, it will be a weighted graph with negative edge weights. This may not be satisfactory, 
so we must do a little more work. 

The utility guarantee of this procedure over the collections Q of linear queries is also not difficult; 
i.e., each query Q G Q is a vector in [0, 1]'*', and on any database V evaluates to Q{V) = (Q,T>). 

Lemma 6.1. Suppose that Q C [0,1]''*'' is some collection of linear queries. For the case \Q\ < 
(/3/2) 2' A 'l/ 6 ; it holds that with probability at least 1-/3, 



\Q(V) - Q(V)\ < e-V6|*|log(|Q|//3) 



for every query Q G Q. For general Q, the error bound is 0(e 1 ^/\X\ \og(\X\/ f3) log(|Q|//3)). 

Proof. Note that (Q(T>) — Q(T>)) = (Q,T> — D) ~ YliQiYi, where each random variable Yi ~ 
Lap(l/e), and qi G [0,1] is the i coordinate of query Q. By a tail bound for sums of Laplace 
random variables (Theorem 16. 5p . we know that 

Pr[|£i=U^I >a] <2ex P (-aV/6|^|), 

as long as a < \X\/e. If we set a = e^ 1 y / 12|A'| log(2|Q|//3) the probability bound on the right 
hand side at most /3, and the condition a < \X\/e translates to \Q\ < (j3/2) 2' X '' G . 

The proof for the general case, where we do not assume a bound on the size \Q\, loses an extra 
factor of 0(y/\og(\X\/f3)). Indeed, with probability at least 1 — /3/2, each of the absolute values 
\Yi\'s are at most L = 0(l/s) log(\X\//3). Now, conditioning on this event happening, the sum 
z2i=i Qi^i behaves like a sum of lA^-many independent [— L, L]-bounded random variables with 
mean 0: in this case, Pr[| ^^ Y i\ > «] < 2e- Q ^ 2 ^ L2 \ x ^ = e -^(» 2 ^ 2 /(W^s(W/m by a sta ndard 
Chernoff bound. Now setting a = 0(e~ 1 y / \X\ log(\X\/f3) log(|Q|//3)) causes this probability to be 
at most /3/ 2; by a union bound, the probability of large deviations is at most f3. □ 



In summary, note that the bounds on the error are ~ e 1 ^\X\ log |Q|, with some correction 
terms depending on whether the size of the query set is at most 2 (' '' or larger. 
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6.1 Randomized Response and Synthetic Data for Cut Queries 

For the case of cuts in graph on a vertex set V, the database is a vector in {0, l}v 2 > , and the noisy 
database just adds independent Lap(l/e) noise to each bit value. Since the query set Q cu ts has size 
2 2 ' v ', (namely it consists of all (S,T) pairs), we have |Q C «£ S | <C (f3/2)2' x '' 6 for all reasonable /3 and 
\V\, we can use the randomized response analysis above to get accuracy 

O ([[( V 2 ) log(|Q cuts |//3)) 1/2 /e) = 0((\V\ 3/2 + |^| log l//3)/e) 

with probability at least 1 — (3. In fact, one can give a slightly tighter analysis where the accuracy 
depends on the size of the sets S, T — by observing that the number of random variables partici- 
pating in a cut query (S,T) is exactly |<S||T|, one can show that the accuracy for all cuts is whp 
0(e-^\V\\S\\T\). ' 

Viewing the noisy database V as a weighted graph G, where the weight of (u, v) is ^.( u ,v)eE(G) + 
Lap(l/e), note that G has negative weight edges and hence cannot be considered synthetic data. We 
can remedy the situation (using the idea of solving a suitable linear program |BCD + 07| |DNR + 09| ) : 

Lemma 6.2 (Synthetic Data for Cuts). There is a computationally efficient (s,0)- differentially 
private randomized algorithm that takes a unweighted graph G and outputs a synthetic graph G' such 
that, with high probability, \\G — G'\\c < 0{\V\ 3 / 2 /e) — all cuts in G and G' are within 0(\V\ 3 / 2 /e) 
additive error. 

Proof. First we construct the noisy datastructure D by perturbing each entry of T> with independent 
noise drawn from Lap(l/e). All further operations will be conducted on D, and so the entire 
algorithm will be (e, 0)-differentially private, let Zij denote the i,j'tb. entry of D: i.e. Zij = 
V[i, j] + Lap(l/e). Let us condition on the event that for every cut (S,T), the additive error bound 
is 0{\V\ 3 ' 2 /e). Now define the following LP: 

min A 
such that y, ( x ij ~ z ij) < ^ yS,T 

ijeSxT 

y] (zij - Xij) < A VS,T 

ijeSxT 

Xij G [0,1]. 

There exists a feasible solution to this LP with A = 0(\V\ 3 < 2 /e), since we can just use the original 
graph to get the solution x^ = 1uj^e(G))' Now if we solve the LP, and output the optimal feasible 
solution to the LP, it would be a weighted graph G' such that ||G — G'\\c < 0(\V\ 3 ^ 2 /e). 

Since the LP has exponentially many constraints, it remains to show how to solve the LP. Define 
the matrix A with Ay = a? y - — Zij, and define A(S, T) := Y^ieSjeT^iJi then the separation oracle 
must find sets S,T such that \A(S, T)\ is larger than A. Equivalently, it suffices to approximately 
compute the cut norm of the matrix A. There is a constant-factor approximation algorithm of 
Alon and Naor for the cut norm problem [AN06J; using this we can solve the LP above to within 
constant factors of optimum. □ 

Remark 6.3. The procedure outlined above results in outputting a weighted graph (with non- 
negative edge weights) V 6 [0, l]l l x l L Note that if the original graph was unweighted: V € 
{0, 1}' ' x ' ' and it is desired to output another unweighted graph, we can simply randomly round 
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D' to an integral solution in the obvious way. This does not incur any asymptotic loss in the stated 
accuracy bound. 

6.2 A Spectral Solution, and Rank-1 Queries 

The Alon-Naor algorithm involves solving SDPs which are computationally intensive, but we can 
avoid that by using the tighter accuracy bound of 0(e ,_1 -y/|V||iS'||T|) we proved. Consider the 
modified LP: 

min A 
such that Y^ (xij- Zij) < Xy/\S\\T\ VS,T 

ijeSxT 

J2 (*ii - *ij) < W\S\\T\ V5,T 

ij^SxT 

Xij 6 [0,1]. 

Again, this LP has a feasible solution (whp) with A = O^V^^fe). And solving the LP to within 

a factor of p, and outputting a near-optimal feasible solution to the LP would give a synthetic 

weighted graph G' such that \\G-G'\\ C <0{p-^\V\\S\\T\/e) = 0{p-\V\^l 2 /e). To this end, define 

the normalized cut norm as 

\ A iS,T)\ 

\\A\\ NC := max — ; 

S,T ^\S\\T\ 

Now the separation problem is to find S,T approximately maximizing the normalized cut norm. 
For this we use a theorem of Nikiforov [Nik09] which says that if o\ (A) is the top singular value of 
A (and ||-A||2 is ^4's spectral norm) then 

\\A\\ NC < ai(A) = \\A\\ 2 < \\A\\ NC • 0(log \V\). (4) 

There is also a polynomial-time algorithm that given the top singular value/vector for A, returns 
a normalized cut {S',T') of value \A(S',T')\/^\S '\\T'\ > a l (A)/0(log \V\) > \\A\\ NC /0(log\V\). 
Using this as a separation oracle we can solve the LP to within p = 0(log | V|) of the optimum, and 
hence get an additive error of 0{e~ l log \V\ ^/WWW\) = 0(e _1 |U| 3 / 2 log \V\). 

Remark 6.4. We note that the theorem of Nikiforov quoted above [Nik09l also implies that the 
synthetic graph released by our algorithm is useful for the (infinite) set of rank-1 queries Q r \ as 
well as the set of cut queries, with only an 0(log \V\) factor loss in the additive approximation for 
each query. 

6.3 A Tail Bound for Laplace Distributions 

The following tail bound for Laplace random variables uses standard techniques, we give it here 
for completeness. 

Theorem 6.5. Suppose {Yi)^ =l are i.i.d. Lap(fo) random variables, and scalars qi E [0,1]. Define 
Y:=ZiQiYi- Then 



■2 



exp I — t^W . if a < kb 

Pr[Y > a] < I y V 6fcPy J - ( 5 ) 

lexp (— H) - ifa>kb 
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Proof. Suppose Y±, Yz, ■ ■ ■ , Yk are i.i.d. Laplace(6) random variables, and qi, q2, ■ ■ ■ , ft's are scalars 
in [0, 1]. We now give a tail bound for Y := Y2i Qi^i- It i s useful to recall that the moment generating 
function for a Laplace random variable Y ~ Lap(6) is -E[e 4 ^] = 1/(1 — b 2 t 2 ) for \t\ < 1/6, and also 
that if Y ~ Lap(6), then cY ~ Lap(c6) for c > 0. Hence for t £ [0, 1/6], we have 

Pr[Y > a] = Pr[e ty > e to ] < ^-^ 



e- ta \{E[e tY *] = e~ ta \[{l - {q l bt) 2 )- 1 
i=l i 

exp(-to-^log(l-( ft 6t) 2 )) 

i 

exp(-ta + ^((* 6 *) 2 + (<^) 4 /2 + (ft&t) 6 /3 + •••)) 



where we used the Taylor series expansion (and hence need that \q%bt\ < 1). The last expression 
only worsens as the q^s increase, so the worst case is when all <& = 1, when we get a bound of 

exp(-to + k((bt) 2 + {btf/2 + (btf/3 + •••)) 
< exp(-to + k((bt) 2 + (btf + (btf + •••)) 

<exp(-fa+fc ^ j ). (6) 

Let us set t := 2^- Recall that we needed the condition that t G [0, 1/6], so let us assume that 
a < kb. This implies that tb = a/2kb < 1/2. Hence, plugging in this setting for t, and noting that 
(1 - (tb) 2 ) > 3/4, we get 



a 



2 „2 \ / ^,2 



q \ ( or \ 



Pr[F > a] < exp (^-^ + fc (3/4)(2A;6)2 J = -P ^-^J 

This completes the proof for the case a < /c6. Now suppose a > fc6; in that case let us set 
t = I /2b — substituting this into ((6|) gives us a tail bound of exp(— a /2b + k/3). And since a > kb, 
this is bounded by exp(— a/66). This proves the theorem. □ 

7 Towards Improving on Randomized Response for Synthetic Data 

In this section, we consider one possible avenue towards giving an efficient algorithm for privately 
generating synthetic data for graph cuts that improves over randomized response. We first show how 
generically, any efficient Iterative Database Construction algorithm can be used to give an efficient 
offline algorithm for privately releasing synthetic data when paired with an efficient distinguisher. 
The analysis here follows the analysis of [GHRUlT] , who analyzed the corresponding algorithm when 
instantiated with the multiplicative weights algorithm, rather than a generic Iterative Database 
Construction algorithm. 

We will pair an Iterative Database Construction algorithm for a class of queries C with a 
corresponding distinguisher. 

Definition 7.1 ((i ? (e),7)-Private Distinguisher). Let Q be a set of queries, let 7 > and let 

F(e) : R + — > Z be a function. An algorithm Distinguish,, : W x ' x W x ' — > Q is an (F(e),7)-Private 
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Distinguisher for Q if for every setting of the privacy parameter e, it is e-differentially private with 
respect to V and if for every V,V' G N 1 * 1 it outputs a Q* £ Q such that \Q*(V) - Q*(V')\ > 
maxQgg \Q[T>) — Q(T>')\ — F(e) with probability at least 1 — 7. 

Note that in [GHRUllJ, we referred to a distinguisher as an agnostic learner. Indeed, a distin- 
guisher is solving the agnostic learning problem for its corresponding set of queries. We here refer 
to it as a distinguisher to emphasize its applicability beyond the typical realm of learning (e.g. we 
here hope to apply a distinguisher to a graph cuts problem). 

Algorithm 3 The Iterative Construction (IC) Mechanism. It takes as input an (i ? (e),7)-Private 
Distinguisher Distinguish^ for Q, together with an £>(a)-iterative database construction algorithm 

Uq. for Q 

IC(X>, e, 6, a, Distinguish, U): 

Let D° = U(0,-,-)- 
Let e = eo(a) = 



iy/B(a) log(l/5) 

for t = 1 to B(a) do 

Let Q(*) = Distinguish^ (P,!)*- 1 ) 

Let J4M=QW(D) + Lap(iV 

if \A® - QWp*" 1 )! < 3a/4 then 

Output V = D* _1 . 
else 

Let £>* = U^^^.QW.iW). 
end if 
end for 
Output V' = D B ( a \ 



What follows is a formal analysis, but the intuition for the mechanism is simple: we simply 
run the iterative database construction algorithm to construct a hypothesis that approximately 
matches T> with respect to the queries C. If our distinguisher succeeds in finding a query that has 
high discrepancy between the hypothesis database and the true database whenever one exists, then 
our IDC algorithm will output a database that is /3-accurate with respect to C. This requires at 
most T iterations, and so we access the data only 2T times using (eo, 0)-differentially private meth- 
ods (running the given distinguisher, and then checking its answer with the Laplace mechanism). 
Privacy will therefore follow from the composition theorem. 

Theorem 7.2. Given parameters e, 5 < 1, The IC mechanism is (e, 5) differentially private. 

Proof. The mechanism accesses the data at most 2B{a) times using algorithms that are eo-differentially 



private. By Theorem ??, the mechanism is therefore (e', <5)-differentially private for e' = ^jAB(a) ln(l/5)eo+ 
2B(a)eo(e e ° — 1). Plugging in our choice of eo proves the claim. □ 

Theorem 7.3. Given an (F(e),j) -private distinguisher and a B{a)-IDC, the Iterative Construc- 
tion mechanism is a, (3 accurate for: 



a > max 



16y/B(a)log(l/5)log(2B( a )/f3) ^ 



V-B(a)log(l/<5) 
so long as 7 < (3/(2B(a)). 
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Proof. The analysis is straightforward. First we observe that because the algorithm runs for at 
most B(a) steps, except with probability at most 0/2, for all t: 



\m - Q w m < I log ^M = VgMMVg log ^ 



a) a 



a 



eo ° P e ° p ~ 4 

Note that by assumption, 7 < f3/(2B(a)), so we also have that except with probability /3/2, 

|g<*>(2>)-QW(2?*-i)| > max|Q'(P)-Q'(P*- 1 )|-F(- 7 ==^===) > max |Q'(P)-Q'(P<-\, 

Q'eS 4 x /B(a)log(l/(5) Q'eS 2 

For the rest of the argument, we will condition on both of these events occurring, which is the 

case except with probability /?. There are two cases. Either a database T>' = D ^ a > is output, or 

database V = D^ 1 for t < B(a) is output. First, suppose V = D B{ - a \ Since for all t \A® - 

Q( t ){D t - l )\ > 3a/4 and by our conditioning, \A^ - Q^{V)\ < % the sequence (D^Q®, ^*)), 

formed a maximal (U Q ,/ 2 ,'D, Q, a/2,i?(a))-Database Update Sequence. Therefore, we have that 

maxQ eQ |<5(£>) - Q(P')I ^ «/ 2 as desired. Next, suppose V = D 1 ' 1 for t < B(a). Then it 

must have been the case that for some t, \A^> — Q^ l > (Z)' _1 )| < 3a/4. By our conditioning, in 

this case it must be that Q^'(T>) — Q^> (D^ 1 ) < a/2, and that therefore by the properties of an 

(.F(eo), 7)-distinguisher: 

max \Q{V) - Q{V')\ < a/2 + F(e ) < a 
QeQ 

as desired. □ 

Note that the running time of the algorithm is dominated by the running time of the IDC 
algorithm and of the distinguishing algorithm: efficient IDC algorithms paired with efficient dis- 
tinguishing algorithms for a class of queries Q automatically correspond to efficient algorithms for 
privately releasing synthetic data useful for Q. For the class of graph cut queries, both the mul- 
tiplicative weights IDC and the Frieze/Kannan IDC are computationally efficient. Therefore, one 
approach to finding a computationally efficient algorithm for releasing synthetic data useful for cut 
queries is to find an efficient private distinguishing algorithm for cut queries. 

One curious aspect of this approach is that it might in fact be computationally easier to release 
a larger class of queries than cut queries, even though this is a strictly more difficult task from an 
information theoretic perspective. For example, solving the distinguishing problem for cut queries 
on graphs T> and T>' is equivalent to finding a pair of sets (S,T) which witness the cut-norm on 
the graph T> — T>'. On the other hand, solving the distinguishing problem for rank-1 queries (which 
include cut queries, and are a larger class) is equivalent to finding the best rank-1 approximation 
to the adjacency matrix T> — D' . The former problem is NP-hard, whereas the latter problem can 
be quickly solved non-privately using the singular value decomposition. 

Corollary 7.4. An efficient (F(e),'y)-distinguisher for the class of rank-1 queries for F{e) = T/e 
would yield an (a, f3)- accurate mechanism for releasing synthetic data for graph cuts (and all rank- 
1 queries) for any (3 > f2(exp(— eT)) and: aMW = 2y2e _1 ' 2 vTm (log |V| log(l/<5)) ' using the 
multiplicative weights IDC, or: ofk > 2e~ l ' 2 (m\og{l/5)) l ' i ^J\V\^ using the Frieze/Kannan IDC 

Proof. The Multiplicative Weights mechanism is a B(a)-IDC for the class of rank-1 queries on 
graphs with m = \E\ edges and \V\ vertices for B(a) = 2m 2 log |y|/a 2 . We can set: 

a>F ( e \ = 4y / 2TmVlog|l/|log(l/J) 
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which allows us to take 

a> 2^/r^(iog|y|io g i) 1/4 



|2 /„,2 



The Frieze/Kannan algorithm is a -B(a)-IDC with B(a) = m\V\ /a . We can set: 

\ _ 4Ty/m\og(l/5)\V\ 



a> F 



which allows us to take: 



4y/B(a)\og(l/6)J ae 

2(mlogl/5) 1 / 4 v / l^l ' T 



a> 






We remark that for the class of rank-1 queries, an efficient (-F(e),7)-distinguisher with F(e) = 
( e ) wom d be sufficient to yield an efficient algorithm for releasing synthetic data useful for 
cut queries, with guarantees matching those of the best known algorithms for the interactive case, 
as listed in Table [TJ For graphs for which the size of the edge set m < Q(n 2 ), this would yield an 
improvement over our randomized response mechanism, which is the best mechanism currently for 
privately releasing synthetic data for graph cuts. We note that a distinguisher for rank-1 queries 
must simply give a good rank-1 approximation to the matrix D — D' . We further note that in 
the case of the Frieze/Kannan IDC for graph cuts, T> — D' is always a symmetric matrix (because 
both the hypothesis is at every step simply the adjacency matrix for an undirected graph, as of 
course is the private database), and hence an algorithm for finding accurate rank-1 approximations 
merely for symmetric matrices would already yield an algorithm for releasing synthetic data for 
cuts! Unlike classes of queries like conjunctions, for which their are imposing barriers to privately 
outputting useful synthetic data |UV1H IGHRUlT] , there are as far as we know no such barriers 
to improving our randomized-response based results for synthetic data for graph cuts. We leave 
finding such an algorithm, for privately giving low rank approximations to matrices, as an intriguing 
open problem. 
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A Other Iterative Database Construction Algorithms 

In this section, we demonstrate how the median mechanism and the multiplicative weights mecha- 
nism fit into the IDC framework. These mechanisms apply to general classes of linear queries Q. 

A.l The Median Mechanism 

In this section, we show how to use the median database subroutine as an Iterative Database 
Construction. 

Definition A.l (Median Datastructure). A median datastructure D is a collection of databases 
D C N' '. Any query can be evaluated on a median datastructure as follows: Q(D) = Median {{Q{V') 
V e D}). 

Algorithm 4 The Median Mechanism (MM) Algorithm. 

Ugf(D*,Q(M(*)) 

If: D* = then: output D° = {V G N 1 * 1 : \V\ = n 2 log k/a 2 } 

Else if: Q(*)(D*) - A® > then: output D'=D'\{DgD: Q {t) (V) > Q (t) (D)} 

Else if: QW(D') - A<*) < then: output D' = D'\{DeD: Q {t) {V) < Q (<) (D)} 



Theorem A. 2. The Median Mechanism algorithm is a B(a) = n 2 log \X\ log k/a 2 iterative database 
construction algorithm for every class of k linear queries Q. 

Proof. Let V G N^ be any database and let {(D'.QW, A®)\ _ be a (Uf M ,D*, Q,a, B)- 

database update sequence. We want to show that B(a) < n 2 log \X\ log k/a 2 . Specifically, that 
after n 2 log \X\ log k/a 2 invocations of U^^, the median datastructure D n log \ x \ lo s fc / a i s (p., Q)- 
accurate for T>. The argument is simple. First, we have a simple fact from [BLR08] : 

Claim A. 3. For any set of k linear queries Q and any database T> of size n, there is a database 
V of size \T>'\ = n 2 log k/a 2 so that V is a-accurate for T> with respect to Q. 

From this claim, we have that |D 4 | > 1 for all t, and so can always be used to evaluate queries. 
On the other hand, each update step eliminates half of the databases in the median datastructure: 
|D*| = |D* _1 |/2. This is because the update step eliminates every database either above or below 
the median with respect to the last query. Initially |D°| = \X\ n lo g fc / a ; and so there can be at 
most B(a) < logn 2 \X\log k/a 2 update steps before we would have |D | < 1, a contradiction. □ 

A. 2 The Multiplicative Weights Mechanism 

In this section we show how to use the multiplicative weights subroutine as an Iterative Database 
Construction. The analysis of the multiplicative weights algorithm is not new, and follows [HR10| . 
It will be convenient to think of our databases in this section as probability distributions, i.e. 
normalized so that \\"D\\i = 1. Note that if we are a/n accurate for the normalized database, we 
are a-accurate for the un-normalized database with respect to any set of linear queries. 

Theorem A. 4. The Multiplicative Weights algorithm is a B{a) = An 2 log \X\/a 2 iterative database 
construction algorithm for every class of linear queries Q. 
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Algorithm 5 The Multiplicative Weights (MW) Algorithm. 



U^(D*,Q(*),A (t) ): 
Let n «- a/(2n). 

If: V 1 = then: output V = V G M 1 * 1 such that L>° = l/|Af| for all i. 
if 1W <Q(*)(X)t) then 



Let r t = Q(*) 




else 

Let r t = 1 - Q (t) 
end if 
Update: For all i G [|Af|] Let 


P* +1 = exp(-r ? r t (P|))-Pl 




V t+1 




^=1 ^j 



Output £> m . 



Proof. Let 2? G N^ be any database and let Up®,Q®,A®)\ be a (U w ,£>*, Q,a,B)- 

database update sequence. We want to show that B(a) < 4n 2 log \X\/a 2 . Specifically, that after 
An 2 log | X\/a 2 invocations of \J MW , the database £> (4 ™ 2 log I*!/" 2 ) is (a, Q)-accurate for V. First let 
T> G W x > be a normalization of the database T>: T>i = Pj/||2?||i. Note that for any linear query, 
Q(V) = n-Q(T>). We define: 

We begin with a simple fact: 

Claim A.5 ( |HR10j ). For all t: * t > 0, and $ < Iog|Af|. 

We will argue that in every step for which \QW(T>) — Q^ t '('D t )\ > a/n the potential drops by at 
least a 2 /An. Because the potential begins at log | Af | , and must always be non-negative, we know 
that there can be at most B{a) < An 2 log | X\/a 2 steps before the algorithm outputs a database P* 
such that maxQgg \Q(T>) — Q(T> t )\ < a/n, which is exactly the condition that we want. 

Lemma A.6 f |HR10] ). 

*t - *t+i > V {nip*) - r t {V)) - n 2 



26 



Proof. 



1*1 /nt+i- 



11 /D t+1 \ 

„-_i V i / 



= -r]r t (V) - log ^exp(-77r t (xi))-D* 

V«=i 

/l*l 

> -T^ t (X>) - log \J2 Dl(l +TJ 2 - r,r t { Xi )) 



> 



i=l 

2 



r) (r*(Z>*) - r t (X>)) - r? 



D 



The rest of the proof now follows easily. By the conditions of an iterative database construction 
algorithm, |^*) - Q (i) (£>)| < a/(2n). Hence, for each t such that \Q^(V) - QW(D*)| > a/n, 
we also have that Q^(D) > Q®{V t ) if and only if A® > Q®(V t ). In particular, r t = Q® if 
g(*)(D*) - Q(*)(D) > a/n, and r* = 1 - QW if QW(P) - QWp*) > a/n. Therefore, by Lemma 
IA.6l and the fact that r] = a/2n: 



2n ' An 2 2n\nJ An 2 An 2 



□ 
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